Machine learning with textural analysis of longitudinal multiparametric MRI and molecular subtypes accurately predicts pathologic complete response in patients with invasive breast cancer

Purpose To predict pathological complete response (pCR) after neoadjuvant chemotherapy using extreme gradient boosting (XGBoost) with MRI and non-imaging data at multiple treatment timepoints. Material and methods This retrospective study included breast cancer patients (n = 117) who underwent neoadjuvant chemotherapy. Data types used included tumor ADC values, diffusion-weighted and dynamic-contrast-enhanced MRI at three treatment timepoints, and patient demographics and tumor data. GLCM textural analysis was performed on MRI data. An extreme gradient boosting machine learning algorithm was used to predict pCR. Prediction performance was evaluated using the area under the curve (AUC) of the receiver operating curve along with precision and recall. Results Prediction using texture features of DWI and DCE images at multiple treatment time points (AUC = 0.871; 95% CI: (0.768, 0.974; p<0.001) and (AUC = 0.903 95% CI: 0.854, 0.952; p<0.001) respectively), outperformed that using mean tumor ADC (AUC = 0.850 (95% CI: 0.764, 0.936; p<0.001)). The AUC using all MRI data was 0.933 (95% CI: 0.836, 1.03; p<0.001). The AUC using non-MRI data was 0.919 (95% CI: 0.848, 0.99; p<0.001). The highest AUC of 0.951 (95% CI: 0.909, 0.993; p<0.001) was achieved with all MRI and all non-MRI data at all time points as inputs. Conclusion Using XGBoost on extracted GLCM features and non-imaging data accurately predicts pCR. This early prediction of response can minimize exposure to toxic chemotherapy, allowing regimen modification mid-treatment and ultimately achieving better outcomes.


Introduction
Neoadjuvant chemotherapy (NAC) [1] in the setting of locally advanced breast cancer can reduce tumor size, making breast conservation surgery feasible and obviating the need for mastectomy. Pathological complete response (pCR) is a desirable endpoint of NAC entailing no residual invasive tumor is present at surgery post NAC [2,3]. Patients with pCR are more likely to be candidates for breast-conserving surgery and to have longer progression-free and overall survival [2,3]. Therefore, pCR can be used as a surrogate for favorable outcome. The ability to predict pCR prior to treatment would help in determining which patients will benefit from NAC and which will not. Furthermore, predicting pCR can allow for changes in treatment regimens, maximizing the chances of pCR. Thus, the accurate prediction of pCR could help in identifying patients who are likely to respond to specific NAC drugs while enabling oncologists to alter treatments mid-course if needed in order to maximize successful outcomes while minimizing the adverse effects of unnecessary chemotherapy.
Currently, the efficacy of a chemotherapy regimen is tested invasively through core needle biopsy. Clinical biomarkers, such as Ki67, give a limited assessment of the entire tumor as they are obtained by core needle biopsy and therefore may not be representative of the entire tumor. Noninvasive imaging can overcome this problem of tumor heterogeneity, as the entire tumor is depicted on images [4]. In fact, the capacity of pre and post NAC MRI to depict response to treatment and predict pCR has already been demonstrated [5][6][7].
Machine learning (ML) has been used to predict eventual pCR. Radiomic features (such as volume, sphericity, DCE MRI signal of wash in and wash out) [8][9][10][11][12] and deep learning analysis of whole MR images [13], DCE dynamics [14] that include demographics and molecular receptor subtypes [15,16] have been used to predict pCR. ML analysis has also been applied to include diffusion MRI [17][18][19][20]. However, the role of diffusion MRI to predict pCR is understudied and the results remain controversial [21].
The goal of this study was to apply the extreme gradient boosting (XGBoost) algorithm to predict pCR using multiparametric MRI data along with non-imaging data at multiple treatment timepoints as inputs. Extreme gradient boosting (XGBoost) was chosen due to the relatively small sample size of the dataset, as XGBoost has been shown to be effective even with limited data. Furthermore, XGBoost has been proven to be effective at classifying with tabular data. This approach has the potential to non-invasively identify patients who are likely to respond to neoadjuvant chemotherapy at diagnosis or early treatment. This approach may prove useful for treatment planning, treatment execution, and mid-treatment adjustment to achieve better outcomes.

Data sources
In this retrospective study, data from the Breast Multiparametric MRI for prediction of NAC Response-2 (BMMR2) competition training dataset, which was curated from the ACRIN-6698 sub trial of the I-SPY 2 TRIAL (NCT Number: NCT01564368), was used to create machine learning models to predict pCR. Patients from the ACRIN-6698 multicenter trial were previously reported in a paper by Partridge et. Al titled "Diffusion-weighted MRI Findings Predict Pathologic Response in Neoadjuvant Treatment of Breast Cancer: The ACRIN 6698 Multicenter Trial", published in Radiology. Data collected in that trial was published as an open dataset. Our paper utilizes that data (n = 117 patients, of which 36 had pCR). While the previous paper attempted to predict pCR solely using DWI MRI with logistic regression as the model technique, we use extreme gradient boosting with multiple types of MRI data as well as patient demographic data to predict pCR. As the data containing the training and testing sets came from a public dataset, no IRB was required.
All 117 female patients in the dataset were diagnosed with invasive breast cancer and underwent 12 weeks of paclitaxel followed by 4 weeks of anthracycline treatment. Each patient sample contained collections of MRI images taken at 3 distinct timepoints, namely, tp0: pretreatment, tp1: 3 weeks post paclitaxel (± experimental agent), and tp2: 12 weeks post paclitaxel (± experimental agent).
Non-imaging data included patient age, race, lesion type (one of multiple masses, single mass or non-mass enhancement), hormone receptor status (hormone receptor (HR) positive/ negative, human epidermal growth factor receptor 2 (HER2) positive/negative), and Scarff-Bloom-Richardson (SBR) grade. Of the 117 patients, 12 patients had missing data (11 had an unknown race, and 1 had an unknown SBR grade). These values were filled using backward fill, which populates the missing data with the data from the next patient. There were no other missing data (Fig 1).
MRI data, including DWI and DCE MRI, were performed bilaterally in the axial orientation using a 1.5-or 3.0-T field strength magnet and a dedicated breast radiofrequency coil as described in [22]. Acquisition parameters are provided in the data source in our data availability statement. DCE images were aligned with their corresponding rectangular tumor masks with position metadata, which highlighted the tumor and the surrounding area. Textural features were extracted using the 3rd dynamic DCE data for all 3 time points in the rectangular region of interest mask. DWI images were aligned with manually defined binary segmentation data provided in the dataset. Features were calculated in the smallest rectangle that encompassed the segmented tumor using b = 800 s/mm 2 DWI imaging data. ADC tumor segmentations were carefully aligned with their corresponding ADC map with position metadata. Tumor ADC values, and changes in these values, were extracted for ML analysis. All images were interpolated to 0.7825 mm x 0.7825 mm voxel spacing with a 2.5 mm slice thickness, which was the median spacing and thickness for all patients. When applying machine learning techniques on multicenter data, data harmonization [23] may be necessary. However, we did not perform additional data harmonization as all MRI acquisitions were taken with ISPY2 acquisition requirements as described in [22]. Furthermore, we ensured that data from different field strengths and different molecular subtypes, etc., were not over-represented in either training and testing data sets by using 5-fold cross validation so that all data has the chance to be in both the test and training sets.

Ground truth
pCR determination was done using histopathologic analysis at study sites by institutional pathologists (blinded to functional tumor volume (FTV) and ADC MRI measures) according to the I-SPY 2 trial protocol using the residual cancer burden system. Following U.S. FDA rationale and guidelines, pCR was the reference standard for determining response to neoadjuvant chemotherapy in our study, defined and reported as no residual invasive disease in either breast or axillary lymph nodes after neoadjuvant therapy. Patients were categorized as having pCR or non-pCR based on postsurgical histopathologic examination findings.

Features used
Texture analysis was performed using a small bounding box enclosing the tumor as determined by functional tumor volume provided by ISPY-2 data. Calculations were performed using the Scikit-Image library in Python. Images were cast to 8-bits, and the number of bins was set to the maximum value of 256 to maximize the number of grey levels counted. GLCM features (Energy, Homogeneity, Contrast, Dissimilarity, ASM, and Correlation) were calculated at distances of 1, 3, and 5 pixels at angles of 0, p 2 , p 4 , and 7p 4 radians, representing all cardinal and ordinal directions. These features were calculated for all 3 treatment time points for both DCE and DWI MRI images, except for the 5-pixel direction for DWI images, as sometimes the segmentation was too small. Table 1 summarizes the significance of each feature and the formula used to calculate it. This resulted in a total of 360 GLCM features. Combined with 6 nonimaging patient data features, and 6 features extracted from ADC parametric map, the total number of features used was 372. Feature selection to reduce the number of features was unnecessary, as XGBoost inherently selects for the most important features when splitting leaves of decision trees, automatically ignoring irrelevant features.

Extreme gradient boosting models
A total of 13 XGBoost models were created using different combinations of data and time points. These XGBoost models were created in Python using the Scikit-Learn API. Minority oversampling was used to balance the frequency of each class in the data set by randomly oversampling the minority (pCR) class. The oversampled balanced dataset consisted of 162 patients with 50% pCR and 50% non-PCR outcomes.
Bayesian optimization along with a sequential domain reduction transformer was used to find optimal values of these hyperparameters. The mean AUC after 5-fold cross-verification was selected as the variable to maximize. 15 rounds of random exploration and 80 rounds of optimization were used. If the optimal value was an extreme of the bound, the bounds were adjusted, and optimization was run again.

Statistical analysis
Distributions of patient characteristics, such as the distribution of lesion types and race, were compared using the χ2 test for homogeneity. Patient age and maximum tumor diameter distributions between classes were compared using a t-test. Investigation of the receptor status characteristics were done using 2 sample z-test of proportions. F-scores were used to calculate the importance of each individual feature in predicting pCR. K-fold cross-validation is considered the gold standard for determining model performance after training. 5 folds of verification were used to designate 20% of the dataset as testing and 80% as training for each fold. All performance metrics were calculated as the mean value of each testing metric of the 5 folds after 1000 rounds of training. Standard error of the mean was calculated as the variability of this mean AUC.
Models were analyzed primarily through the area under the receiver operator curve (AUC). AUC has been shown to be the optimal method for comparing AI models. Precision and recall were secondary metrics for model performance. P-values < 0.05 were considered significant and were calculated as the probability that the null hypothesis is true.

Results
Patient characteristics and the sample sizes (n = 117) are described in Table 2. There were 36 patients with pCR and 81 without pCR. There was no significant difference in the mean age for patients with pCR (49.08 +/-10.31 years) and patients without pCR (49.0 +/-11.73 years) (p = 0.971). There was also no significant difference in the longest diameter of the tumor between classes (p = 0.252). The distributions of patient races (p = 0.205), lesion types (p = 0.409), and SBR grades (p = 0.488) showed no significant difference. There was a significant difference in distributions of receptor statuses (p<0.001), with the patient without pCR For pCR patients, regional tumor ADCs were 0.59±0.05, 0.81±0.04, and 1.11±0.06 x10 -3 mm 2 /s at pre-treatment, early treatment, and mid-treatment, respectively (Fig 6). For non-pCR patients, the corresponding tumor ADC values were 0.59±0.03, 0.74±0.03, and 0.91±0.05 x10 -3 mm 2 /s. ADC values for both PCR and non-PCR patients increased with treatment, with pCR patients showing a larger increase progressively.

PLOS ONE
Extreme gradient boosting accurately predicts pathological complete response in breast cancer We also evaluated the effects of using different combinations of treatment time points on prediction performance. For prediction models using all available MRI and non-MRI data, AUC using tp0, tp1, tp2, and tp0+tp1 were respectively, 0.918 (95% CI: 0.856, 0.98; p<0.001),    Table 4).

Discussion
This study aimed to predict pCR non-invasively using XGBoost models with multiparametric MRI data at multiple treatment timepoints along with non-imaging data. This paper is the first to apply an extreme gradient boosting algorithm to the ISPY-2 data set to predict pathological complete response, and the second ever to use such an algorithm for predicting NAC response. This paper is also the first ever to use GLCM features to do so.
In the original I-SPY2 study, Partridge et al. [22] used logistic regression to evaluate if change in tumor ADC values (not texture analysis) is predictive of PCR in breast cancer patients. They found that changes in tumor ADC values was moderately predictive of pCR at mid-treatment (AUC = 0.60; 95% CI: 0.52, 0.68). In a test subset, a model combining tumor subtype and mid-treatment changes in ADC significantly improved predictive performance (AUC = 0.72; 95% CI: 0.61, 0.83).
There have only been a few studies that have applied deep learning on MR images themselves as inputs to predict pCR. Braman et al. [7] applied deep learning to predict NAC in HER2 patients from pre-treatment 2D DCE MRI in a retrospective study. They explored both pre-contrast and late post-contrast phases of DCE MRI and found an AUC of 0.74 (± 0.03). Qu et al. [24] built a CNN model using pre and post-NAC. Tumor regions were manually segmented by two expert radiologists on enhanced T1-weighted images. They found an AUC of 0.553 (95% CI: 0.416, 0.683) for pre-NAC data.
There have also only been a few studies that have applied non-deep learning on MR images themselves as inputs to predict pCR. Suo et al. [17] evaluated mono-exponential (ADC), biexponential and stretched-exponential from diffusion MRI data to predict pCR. They also included tumor size and relative enhancement ratio from DCE MRI data at 3 time points: before treatment, at mid-treatment, and after treatment with NAC. They found that flowinsensitive ADC change at mid-treatment was the most predictive feature, with an AUC of 0.831 (95% CI: 0.747, 0.915; P < 0.001). Combining this with receptor statuses, the AUC increased to 0.905 (95% CI: 0.843, 0.966; P < 0.001). Bian et al. [25] analyzed radiomic signatures based on T2W imaging, diffusion-weighted imaging, dynamic contrast-enhanced imaging and their combination to predict pCR. Logistic regression was then used to assess the association between features and clinical risk factors. The combined radiomic signature and nomogram model achieved an AUC of 0.91 (95% CI: 0.86, 1.00). Chen et al. [26] similarly analyzed the ability of radiomic signatures extracted from MRI data to predict pCR. They achieved an AUC of 0.848 by combining DCE-MRI and ADC maps. Eun et al. [19] performed non-GLCM texture analysis of pre and mid-treatment T2-weighted MRI, DCE, DWI, and ADC

Limitations
This study was performed on a relatively small multi-center dataset. As more data releases, our findings need to be replicated on a larger dataset to improve the generalizability of the model as well as account for the racial bias towards white women in this dataset. We have only evaluated XGBoost, other machine learning methods such as LightGBM should also be explored. Finally, we have only extracted GLCM features from MRI imaging data. Using additional features such as radiomic textural features may improve the model and give it more perspectives on the images. We employed images as provided and performed only visual image quality checks. We did not perform distortion correction, eddy current correction, check for multisite scanner consistency, among others, as raw data were not available. It has been shown that accurate estimates and reproducibility of diffusion indices in clinical studies require careful data quality assurance [28]. Moreover, large multicenter studies have shown a non-negligible variability in quantitative diffusion indices measured using different scanner systems [29,30]. Therefore, MRI scanner system should be adequately characterized in diffusion-MRI of the breast [31]. Deep learning methods could also be applied to predict PCR [24,[32][33][34][35][36][37] (see review [38]) but were not analyzed.

Conclusions
XGBoost models using multiparametric MRI data along with demographic and molecular subtype data, accurately predict pCR to NAC. With further development and testing on larger multi-institutional sample sizes, this approach has the potential to non-invasively identify patients who are likely to respond to neoadjuvant chemotherapy at diagnosis or early treatment. This approach may prove useful for treatment planning, treatment execution, and midtreatment adjustment to achieve better outcomes.